During this short tutorial we will go through a sample workflow with the DESeq2 R package (see Anders and Huber (2010)).
We will only go through all the standard functions and analyses of DESeq in this practical.
DESeq2 Analysis
Loss of function of myosin chaperones triggers Hsf1-mediated transcriptional response in skeletal muscle cells
Christelle Etard, Olivier Armant, Urmas Roostalu, Victor Gourain, Marco Ferg and Uwe Strähle
Genome Biology 2015 16:267 https://doi.org/10.1186/s13059-015-0825-8
RNA-seq_Strahle_Lab_0005AS.<SequencingID>.USERvgourain.R.ReadsPerGene.out.tab
| Hpf | wt | unc45b |
|---|---|---|
| 24 | DCD001548SQ | DCD001560SQ |
| DCD001559SQ | DCD001554SQ | |
| 48 | DCD001546SQ | DCD001564SQ |
| DCD001558SQ | DCD001555SQ | |
| 72 | DCD001547SQ | DCD001565SQ |
| DCD001545SQ | DCD001551SQ |
All the libraries were unstranded paired-ended sequenced on Illumina HiSeq 2000 producing 50bp long reads.
Generate count tables of reads per gene
example of ReadsPerGene.out.tab table:
$ head RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
N_unmapped 1100914 1100914 1100914
N_multimapping 11275176 11275176 11275176
N_noFeature 6090090 14560190 14670407
N_ambiguous 478755 112331 109422
ENSDARG00000104632 26 14 13
ENSDARG00000100660 42 19 27
ENSDARG00000098417 10 8 3
ENSDARG00000100422 2406 1201 1219
ENSDARG00000102128 250 133 118
ENSDARG00000103095 621 318 307
select columns (gene IDs, unstranded reads) and ignore the first 4 lines
$ tail +5 RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab | awk '{print $1 "\t" $2}' > RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.DESeq.tab
$ tail +5 RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab | awk '{print $1 "\t" $2}' > RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.DESeq.tab
$ tail +5 RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab | awk '{print $1 "\t" $2}' > RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.DESeq.tab
$ tail +5 RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab | awk '{print $1 "\t" $2}' > RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.DESeq.tab
$ htseq-count [options] <alignment_files> <gff_file>
Options:
-s, whether the data is from a strand-specific assay
-r, the alignment have to be sorted either by read name or by alignment position
-f, input format (SAM/BAM)
First you will want to specify a variable which points to the directory in which the htseq-count output files are located.
library("DESeq2")
directory <- "../data/provided_files/data_DESeq/"
We specify which files to read in using list.files, and select those files contain the extension “.tab” using grep. Then we set the sample conditions for downstream analyses. Finally, we make a dataframe containing this information.
sampleFiles <- grep("*.tab", list.files(directory), value = TRUE)
sampleCondition <- c("wt", "unc45b", "wt", "unc45b")
sampleTable <- data.frame(sampleName = sampleFiles, fileName = sampleFiles,
condition = sampleCondition)
# show the sampleTable
sampleTable
## sampleName
## 1 RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## 2 RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## 3 RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## 4 RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
## fileName
## 1 RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## 2 RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## 3 RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## 4 RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
## condition
## 1 wt
## 2 unc45b
## 3 wt
## 4 unc45b
Then we build the DESeqDataSet using the following function as we have htseq-count files:
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory,
design = ~condition)
The standard differential expression analysis steps are wrapped into a single function, DESeq.
dds <- DESeq(ddsHTSeq)
normalised.table <- (counts(dds, normalized = T))
write.csv(normalised.table, "../data/results/DESeq2-wt_vs_unc45b-normalised.counts.csv")
The size factors are accessible via sizeFactors:
sizeFactors(dds)
## RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## 1.2696225
## RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## 0.7059657
## RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## 0.9633211
## RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
## 1.1855957
For every gene, a Negative Binomial (NB) distribution is fitted based on the counts to estimate the dispersion in 3 steps:
estimates dispersion parameter for each gene
plots and fits a curve
adjusts the dispersion parameter towards the curve
Legend:
plotDispEsts(dds)
Results tables are generated using the function results(), which extracts a results table with log2 fold-changes, p-values and adjusted p-values. With no additional arguments to results, the log2 fold-change and Wald test p-value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the first level (unc45b over wt).
Details about the comparison are printed to the console, above the results table. The text, condition treated vs untreated, tells you that the estimates are of the logarithmic fold-change log2 (wt vs unc45b).
# get results
res <- results(dds)
# order for easy view:
res.ord <- res[order(res$padj), ]
# let's check the top differentially expressed genes!
head(res.ord)
## log2 fold change (MLE): condition wt vs unc45b
## Wald test p-value: condition wt vs unc45b
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSDARG00000037403 14439.485 -8.296490 0.1813989 -45.73616
## ENSDARG00000056210 6310.683 -7.550094 0.2408243 -31.35105
## ENSDARG00000094719 2027.783 -5.489757 0.1922252 -28.55899
## ENSDARG00000012381 8790.944 -3.916069 0.1499467 -26.11640
## ENSDARG00000055723 4247.576 -5.538385 0.2152812 -25.72629
## ENSDARG00000041065 11004.821 -5.144954 0.2059325 -24.98369
## pvalue padj
## <numeric> <numeric>
## ENSDARG00000037403 0.000000e+00 0.000000e+00
## ENSDARG00000056210 9.414973e-216 8.257873e-212
## ENSDARG00000094719 2.172252e-179 1.270188e-175
## ENSDARG00000012381 2.374718e-150 1.041432e-146
## ENSDARG00000055723 5.940092e-146 2.084022e-142
## ENSDARG00000041065 9.195203e-138 2.688371e-134
We can summarize some basic tallies using the summary function.
summary(res)
##
## out of 28527 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 132, 0.46%
## LFC < 0 (down) : 302, 1.1%
## outliers [1] : 0, 0%
## low counts [2] : 10985, 39%
## (mean count < 34)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
How many adjusted p-values were less than 0.05?
sum(res$padj < 0.05, na.rm = TRUE)
## [1] 337
The results function contains a number of arguments to customize the results table which is generated. You can read about these arguments by looking up ?results. Note that the results function automatically performs independent filtering based on the mean of normalised counts for each gene, optimizing the number of genes which will have an adjusted p-value below a given FDR cutoff, alpha. Independent filtering is further discussed below. By default the argument alpha is set to 0.1. If the adjusted p-value cutoff will be a value other than 0.1, alpha should be set to that value: here that would be 0.05.
res05 <- results(dds, alpha = 0.05)
res05 <- na.omit(res05) #remove NA values
summary(res05)
##
## out of 16993 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 93, 0.55%
## LFC < 0 (down) : 249, 1.5%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 42)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Export results (FDR 0.05)
write.csv(as.data.frame(res05), file = "../data/results/DESeq2-wt_vs_unc45b-FDR05.csv")
In DESeq2, the function plotMA shows the log2 fold-changes attributable to a given variable over the mean of normalised counts for all the samples in the DESeqDataSet. Points will be colored red if the adjusted p value is less than 0.1. Points which fall out of the window are plotted as open triangles pointing either up or down.
plotMA(dds, alpha = 0.05, ylim = c(-5, 5), main = "DESeq2-MAplot: WT vs unc45b")
You can plot single genes with the plotCounts() function. As long as you know the row number of the gene of interest. So when we want to look at the most significant differentially expressed gene:
plotCounts(dds, gene = which.min(res$padj), intgroup = "condition")
Or just your favourite gene PARP12:
plotCounts(dds, gene = which(rownames(res) == "ENSDARG00000100660"), intgroup = "condition")
In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data.
Maybe the most obvious choice of transformation is the logarithm.
One makes use of the concept of variance stabilizing transformations (VST) (Tibshirani (1988); Huber et al. (2003); Anders and Huber (2010)), and the other is the regularized logarithm or rlog, which incorporates a prior on the sample differences (Love, Huber, and Anders (2014)). Both transformations produce transformed data on the log2 scale which has been normalised with respect to library size or other normalisation factors.
The assay function is used to extract the matrix of normalised values.
rld <- rlogTransformation(dds, blind = TRUE)
vsd <- varianceStabilizingTransformation(dds, blind = TRUE)
head(assay(vsd), 3)
## RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632 7.142512
## ENSDARG00000100660 7.489663
## ENSDARG00000098417 6.897470
## RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632 6.930279
## ENSDARG00000100660 7.220474
## ENSDARG00000098417 6.504708
## RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632 6.841847
## ENSDARG00000100660 7.466815
## ENSDARG00000098417 6.953443
## RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632 6.934472
## ENSDARG00000100660 7.234222
## ENSDARG00000098417 7.146909
head(assay(rld), 3)
## RNA-seq_Strahle_Lab_0005AS.DCD001546SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632 5.396667
## ENSDARG00000100660 6.260150
## ENSDARG00000098417 5.121452
## RNA-seq_Strahle_Lab_0005AS.DCD001555SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632 5.281471
## ENSDARG00000100660 6.105386
## ENSDARG00000098417 4.963018
## RNA-seq_Strahle_Lab_0005AS.DCD001558SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632 5.236849
## ENSDARG00000100660 6.244054
## ENSDARG00000098417 5.149828
## RNA-seq_Strahle_Lab_0005AS.DCD001564SQ.USERvgourain.R.ReadsPerGene.out.tab
## ENSDARG00000104632 5.281899
## ENSDARG00000100660 6.107221
## ENSDARG00000098417 5.257630
Another use of the transformed data is sample clustering. Here, we apply the dist function to the transpose of the transformed count matrix to get sample-to-sample distances and plot in a heatmap.
library("RColorBrewer")
library("gplots")
distsRL <- dist(t(assay(rld)))
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition, sep = " : "))
heatmap.2(mat, trace = "none", col = rev(hmcol), margin = c(13, 13), cexRow = 0.9,
cexCol = 0.9)
Related to the distance matrix is the PCA plot, which shows the samples in the 2D plane spanned by their first two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.
print(plotPCA(rld, intgroup = c("condition")))
print(sessionInfo())
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] gplots_3.0.1 RColorBrewer_1.1-2
## [3] DESeq2_1.18.1 SummarizedExperiment_1.8.0
## [5] DelayedArray_0.4.1 matrixStats_0.52.2
## [7] Biobase_2.38.0 GenomicRanges_1.30.0
## [9] GenomeInfoDb_1.14.0 IRanges_2.12.0
## [11] S4Vectors_0.16.0 BiocGenerics_0.24.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.14 locfit_1.5-9.1
## [3] lattice_0.20-35 gtools_3.5.0
## [5] rprojroot_1.2 digest_0.6.12
## [7] plyr_1.8.4 backports_1.1.1
## [9] acepack_1.4.1 RSQLite_2.0
## [11] evaluate_0.10.1 ggplot2_2.2.1
## [13] zlibbioc_1.24.0 rlang_0.1.4
## [15] lazyeval_0.2.1 gdata_2.18.0
## [17] data.table_1.10.4-3 annotate_1.56.1
## [19] blob_1.1.0 rpart_4.1-11
## [21] Matrix_1.2-12 checkmate_1.8.5
## [23] rmarkdown_1.8 labeling_0.3
## [25] splines_3.4.2 BiocParallel_1.12.0
## [27] geneplotter_1.56.0 stringr_1.2.0
## [29] foreign_0.8-69 htmlwidgets_0.9
## [31] RCurl_1.95-4.8 bit_1.1-12
## [33] munsell_0.4.3 compiler_3.4.2
## [35] base64enc_0.1-3 htmltools_0.3.6
## [37] nnet_7.3-12 tibble_1.3.4
## [39] gridExtra_2.3 htmlTable_1.9
## [41] GenomeInfoDbData_0.99.1 Hmisc_4.0-3
## [43] XML_3.98-1.9 bitops_1.0-6
## [45] grid_3.4.2 xtable_1.8-2
## [47] gtable_0.2.0 DBI_0.7
## [49] magrittr_1.5 formatR_1.5
## [51] scales_0.5.0 KernSmooth_2.23-15
## [53] stringi_1.1.6 XVector_0.18.0
## [55] genefilter_1.60.0 latticeExtra_0.6-28
## [57] Formula_1.2-2 tools_3.4.2
## [59] bit64_0.9-7 survival_2.41-3
## [61] yaml_2.1.14 AnnotationDbi_1.40.0
## [63] colorspace_1.3-2 cluster_2.0.6
## [65] caTools_1.17.1 memoise_1.1.0
## [67] knitr_1.17
Anders, Simon, and Wolfgang Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biology 11 (10). BioMed Central: R106. doi:10.1186/gb-2010-11-10-r106.
Huber, Wolfgang, Anja von Heydebreck, Holger Sueltmann, Annemarie Poustka, and Martin Vingron. 2003. “Parameter estimation for the calibration and variance stabilization of microarray data.” Statistical Applications in Genetics and Molecular Biology 2 (1): Article3. doi:10.2202/1544-6115.1008.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 550. doi:10.1186/s13059-014-0550-8.
Tibshirani, Robert. 1988. “Estimating Transformations for Regression Via Additivity and Variance Stabilization.” Journal of the American Statistical Association 83 (402). Taylor & Francis, Ltd.American Statistical Association: 394. doi:10.2307/2288855.